❓❓ Questions for you
Imagine you are tasked with developing a recommender system for YouTube. You possess data on which users clicked on which videos. After spending considerable time building a recommender system using this data, you realize it isn’t producing high-quality recommendations. What could be the reasons for this?
Imagine you are tasked with developing a recommender system for YouTube. You possess data on which users clicked on which videos. After spending considerable time building a recommender system using this data, you realize it isn’t producing high-quality recommendations. What could be the reasons for this?
- Clicks \(\neq\) True preferences
- No information about watch time
- Clicks can be gamed
Think beyond what’s given to you
Questions you have to consider:
- Who is the decision maker?
- What are their objectives?
- What are their alternatives?
- What is their context?
- What data do I need?
Decisions involve a few key pieces
- The decision variable: the variable that is manipulated through the decision.
- E.g. how much should I sell my house for? (numeric)
- The decision-maker’s objectives: the variables that the decision-maker ultimately cares about
- E.g. my total profit, time to sale, etc.
- The context: the variables that mediate the relationship between the decision variable and the objectives.
- E.g. the housing market, cost of marketing it, my timeline, etc.
Poor vs. Effective communication
Imagine that you’re explaining your work to your manager who is not very technical. Which one is pooer and which one is more effective? Why?
Communication 1
“I built models to predict next week’s avocado prices. The ridge model had an RMSE of 0.79, but the random forest performed better with tuned hyperparameters. The cross-validation score improved after adding lag features. We should use the random forest.”
Communication 2
“Our avocado price forecast reduces weekly price uncertainty by 15%. This lets the procurement team lock in contracts earlier and avoid overpaying during high-volatility weeks, saving an estimated $45k per month. We need 2 days to automate data updates and a weekly accuracy review.
Risk: model performance drops during holiday spikes. Here’s our mitigation plan.”
Poor vs. Effective communication
❌ Poor communication:
“I built a model to predict next week’s avocado prices. The ridge model had an RMSE of 0.79, but the random forest performed better with tuned hyperparameters. The cross-validation score improved after adding lag features. We should use the random forest.”
Result: The manager doesn’t know why this matters, how it affects decisions, or what to do next. No adoption. 😢
✅ Effective reframe:
“Our avocado price forecast reduces weekly price uncertainty by 15%. This lets the procurement team lock in contracts earlier and avoid overpaying during high-volatility weeks, saving an estimated $45k per month. We need 2 days to automate data updates and a weekly accuracy review.
Risk: model performance drops during holiday spikes. Here’s our mitigation plan.”
Result: Clear value, operational impact, required effort, and risks. Enables decision-making!!
Key difference: Shift from model-centric communication → decision-ready communication.
Confidence and predict_proba
- What does it mean to be “confident” in your results?
- When you perform analysis, you are responsible for many judgment calls.
- Your results will be different than others.
- As you make these judgments and start to form conclusions, how can you recognize your own uncertainties about the data so that you can communicate confidently?
Let’s imagine that the following claim is true:
Vancouver has the highest cost of living of all cities in Canada.
Now let’s consider a few beliefs we could hold:
- Vancouver has the highest cost of living of all cities in Canada. I am 95% sure of this.
- Vancouver has the highest cost of living of all cities in Canada. I am 55% sure of this.
The part is bold is called a credence. Which belief is better?
But what if it’s actually Toronto that has the highest cost of living in Canada?
- Vancouver has the highest cost of living of all cities in Canada. I am 95% sure of this.
- Vancouver has the highest cost of living of all cities in Canada. I am 55% sure of this.
Which belief is better now?
We don’t just want to be right. We want to be confident when we’re right and hesitant when we’re wrong.
Loss in machine learning
When you call fit for LogisticRegression it has similar preferences:
correct and confident
> correct and hesitant
> incorrect and hesitant
> incorrect and confident
- This is a “loss” or “error” function like mean squared error, so lower values are better.
- When you call
fit it tries to minimize this metric.
Logistic regression loss
- confident and correct \(\rightarrow\) smaller loss
- hesitant and correct \(\rightarrow\) a bit higher loss
- hesitant and incorrect \(\rightarrow\) even higher loss
- confident and incorrect \(\rightarrow\) high loss
In our final exam, imagine if, along with your answers, we ask you to also provide a confidence score for each. This would involve rating how sure you are about each answer, perhaps on a percentage scale from 0% (completely unsure) to 100% (completely sure). This method not only assesses your knowledge but also your awareness of your own understanding, potentially impacting the grading process and highlighting areas for improvement. Who supports this idea 😉?
Misleding visualizations
This chart is attempting to suggest a relationship between childhood MMR vaccination rates and the prevalence of autism spectrum disorders (AD/ASD) across several countries.
Do you see any problems with this visualization?